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X/^ . The technological applications of hidden Markov models have been extremely diverse and 

^ \ successful, including natural language processing, gesture recognition, gene sequencing, and 

Kalman filtering of physical measurements. HMMs are highly non-linear statistical models, 
and just as linear models are amenable to linear algebraic techniques, non-linear models are 
amenable to commutative algebra and algebraic geometry. 

This paper closely examines HMMs in which all the hidden random variables are binary. 
Its main contributions are (1) a birational parametrization for every such HMM, with an 
explicit inverse for recovering the hidden parameters in terms of observables, (2) a semialge- 
braic model membership test for every such HMM, and (3) minimal defining equations for the 
4-node fully binary model, comprising 21 qu adrics and 29 cubics, whi ch were computed using 



eters in (1) are rationally identifiable in the sense of lSullivant. Garcia-Puente. and Spielvogel 



Grobner bases in the cumulant coordinates of lSturm fels and Zwicrni kj. The new model param 

o 



■ and each model's Zariski closure is therefore a rational projective variety of dimension 5. 

Grobner basis computations for the model and its graph are found to be considerably faster 
using these parameters . In the case of two hidden states, item (2) supersedes a previous 
algorithm of Schonhuth which is only generically defined, and the defining equations (3) yield 
. new invariants for HMMs of all lengths > 4. Such invariants have been used successfully in 

model selection problems in phylogenetics, and one can hope for similar applications in the 
case of HMMs. 

_c?_' 1 Introduction 

The present work is motivated primarily by the problems of model selection and parameter 
identifiability, viewed from the perspective of algebraic geometry. By beginning with the 
simplest hidden Markov models (HMMs) — those where all hidden nodes are binary — the 
hope is that eventually a very precise geometric understanding of HMMs can be attained 
that provides insight into these central problems. Indeed, most questions about this case are 
answered by reducing to the case where the visible nodes are also binary. The history of this 
and related problems has two main branches of historical lineage: that of hidden Markov 
models, and that of algebraic statistics. 

Hidden Markov models were developed as statistical model s in a series of papers by 
Leonard E. Baum and others beginning with Baum and Petrie (Il966h . after the descrip- 



tion by Stratonovich (jl960l ) of the "forward-backward" algorithm that would be used for 
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HMM parameter estimation. HMMs have been used extensively in n atural l angua ge process- 
ing a nd speech recognition since the d evelopment of DRAGON by Baker ( 19751 ). As well, 
Krogh. Mian, and Haussler ( 19941 ) used HMM for gene finding in the DNA of in E. coli 



since 



bacteria, they have had many applications in genomics and biological sequence alignment; 
see also ( Yoon . 20091 ). Now, HMM parameter estimation is built into the measurement of so 
many kinds of time-series data that it would be gratuitous to enumerate them. However, the 
methods of algebraic statistics are not so old, and the algebraic geometry of these models 
is far from fully explored. They are hence an important early example for the theory to 
investigate. 

Algebraic statistics is the application of commutative algebra and algebraic geometry to 
the study of statistical models, especially those models involving non-linear relations between 
parameters and observables. It was first described at length in the monograph Algebraic 
Statistics by Pistone. Riccomagno. and Wvnn (20011*1. Subse quent introductions to the sub - 
ject include Algebraic Statistics for C omputation Biology by Pachter and Sturmfels ( 20051 ). 
and Lectures in Algebraic Statistics bv lDrton. Sturmfels. and Sullivant (l2009h . Also notable 
is Algebraic Geometry and Statistical Learning Theory by IWatanabd (|2009i ). for its focus on 
the problem of model selection from data. 

To the problem of model selection, the algebraic analogue is implicitization, i.e., finding 
polynomial defining equations for the Zariski closures of binary hidden Markov models. Such 
polynomials are called invariants of the model: if a polynomial / is equal to a constant c at 
every point on the model (i.e. does not vary with the model parameters), then we encode 
this equation by calling / — c an invariant. Model selection and implicitization are more than 
simply analogous; polynomial inva r iants have been used succ essfully in model selection by 
Casanellas and Fernandez-Sanchez (2006) and Eriksson ( 2008 ) for phylogenetic trees. 



Invariants have been difficul t to classify for hidden M arkov models, perhaps due to the 
high codimension of the models. iBrav and Morton! (|2005l ) found many invariants using linear 
algebra, but did not exhibit any generating sets of invariants, and in fact their search was 
actual ly for invariants of a model that was slightly modified from the HMM proper. ISchonhuth 
(|201lh bund a large family of invariants arising as minors of certain non-abelian Hankel 
matrices, and was able to verify that such invariants generate the ideal of the 3-node binary 
HMM, the simplest non-degenerate HMM. However, this seemed not to be the case for models 
with n > 4 nodes: Schonhuth reported on a computation of J. Hauenstein which verified 
numerically that the 4- node model was not cut out by the Hankel minors. 

In ISection 31 we will mak e use of moment and cumulant coordinates as exposited in 
( Sturmfels and Zwiernik . 201 ll ). as well as a new coordinate system on the parameter space, 
to find explicit defining equations for the 4-node binary HMM. The shortest quadric and 
cubic equations are fairly simple; to give the reader a visual sense, they look like this: 



92,1 

93,1 = mf 2 - 2m 1 m 12 m 1 23 + w mf 23 + m\ mi 2 34 - "^12^1234 

Here each m is a moment of the observed probability distribution. These equations are not 
generated by Schonhuth's Hankel minors, and so provide a finer test for membership to any 
binary HMM of length n > 4 after marginalizing to any 4 consecutive nodes. 

To the problem of parameter identifiability, the algebraic analogue is the generic or global 
injectivity or finiteness of a map of varieties that parametrizes the model, or in the case of 
identify ing a single paramete r, constancy of the parameter on the fibers of the parameteri- 
zation. Sulli vant et al. (l20Klh provide an excellent discussion of this topic in the context of 



Pistone et al.l attribute their interest in the subject to a seminar paper of iDiaconis and Sturmfels! (jl998h 



circulated as a manuscript in 1993, which employed Grobner bases to construct Markov random walks. 
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identifying causal effects; see also ( Meshkat. Eisenberg. and DiStefano . 20091 ) for a striking 



application to identification for ODE models in the biosciences. 

In ISection "H for the purpose of parameter identification in binary hidden Markov mod- 
els, we express the parametrization of a binary HMM as the composition of a dominant and 
generically finite monomial map q and a birationally invertible map ip. An explicit inverse 
to V is given, which allows for the easy recovery of hidden parameters in terms of observ- 
ables. The compone nts of the monomial map are identifiable combinations in the sense of 
Meshkat et al. (j2009l ). The formulae for recovering the hidden parameters are fairly simple 



when exhibited in a particular order, corresponding to a particular triangular set of genera- 
tors in a union of lexicographic Grobner bases for the model ideal. To show their simplicity, 
the most complicated recovery formula looks like this: 

mim 3 -m\ + m 23 - myi 

u = 

2(m 3 - m 2 ) 

As a corollary, in ISection 4.31 we find that the fibers of <f> n are generically zero-dimensional, 
consisting of two points which are equivalent under a "hidden label swapping" operation. 

ISection "5l describes how the parametrization of every fully binary HMM, or "BHMM" , 
can be factored through a particular 9-dimensional variety called a trace variety, which is 
the invariant theory quotient of the space of triples of 2 x 2 matrices under a simultaneous 
conjugation action by SL 2 . As a quotient, the trace variety is not defined in side any particu lar 
ambient space. However, its coordinate ring, a trace algebra, was found by ISibirskii (|l968h to 



be generated by 10 elements, which means we can embed the trace variety in C 10 . We prove 
the main results of ISection 4l in the coordinates of this embedding. As a byproduct of this 
approach, in section ISection 5.61 we find that the Zariski closures of all BHMMs with n > 3 
are birational to each other. 

Finally, ISection "B1 explores some applications of our results, including model membership 
testing, classification of identifiable parameters, a new grading on HMMs that can be used to 
find low-degree invariants, the geometry of equilibrium BHMMs, and HMMs with more than 
two visible states. 

I would like to thank my advisor, Bernd Sturmfels, and postdoctoral mentor, Shaowei 
Lin, for many helpful conversations and editorial suggestions on this paper. 



2 Definitions 

Important note: In this paper, we will work mostly with BHMMs — HMMs in which both 
the hidden and visible nodes are all binary — because, as will be explained in ISection 2.3| 
all our results will generalize to allow > 2 visible states by reducing to this case. 

Throughout, we will be referring to binary hidden Markov processes, distributions, maps, 
models, varieties, and ideals. Each of these terms is used with a distinct meaning, and effort 
is made to keep their usages consistent and separate. 

2.1 Binary Hidden Markov processes and distributions 

A binary hidden Markov process is a statistical process which generates random binary se- 
quences. It is based on the simpler notion of a binary (and not hidden) Markov chain process. 

Definition 2.1. A Binary Hidden Markov process will comprise 5 data: ir, T, E, and 
(H t , Vt). The pair (H t , Vt) denotes a jointly random sequence (Hi, Vi, H 2 , V 2 , . . .) of binary 
variables, also respectively called hidden nodes and visible nodes, with range {0, 1}. Often a 
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bound n on the (discrete) time index t is also given. The joint distribution of the nodes is 
specified by the following: 

• A row vector tt = (7To,7Tx), called the initial distribution, which specifies a probability 
distribution on the first hidden node H\ by Vx{H\ = i) = 71",; 



A matrix T 



Too T i 
sition" probabilities by t 



, called the transition matrix, which specifies conditional "tran- 



re formula V?(H t = j \ H t ~i = i) = Tij, read as the probability 
of "transitioning from hidden state i to hidden state j" o 



A matrix E 



called the emission matrix, which specifies conditional "emis- 



-^00 Eqi 

Eiq En 

sion" probabilities by the formula Pr(Vj = j \ H% = i) = Eij, read as the probability 
that "hidden state i emits the visible state j" . 



To be precise, the parameter vector 9 = (tt, T, E) determines a probability distribution 
on the set of sequences of pairs ((Hi,Vi) . . . (H n ,V n )) G ({0, l} 2 ) ra , or if no bound n is 
specified, a compatible sequence of such distributions as n grows. In applications, only the 
joint distribution on the visible nodes (Vi, . . . ,V n ) G {0, l} n is observed, and is called the 
observed distribution. This distribution is given by marginalizing (summing) over the possible 
hidden states of a BHM process: 

Pr(V = v\6 = (n,T,E)) = Pr(h,v\ir,T,E) = ^ Pr(/i | ir, T) Pr(w | h, E) 

/iG{0,l} n hG{0,l} n 



^ n hl E hljVl \[T hi _ lhi E huVi (1) 



E 

/iS{0,l} n i=2 



Definition 2.2. A Binary Hidden Markov distribution is a probability distribution on 
sequences v G {0, l} n of jointly random binary variables (Vi,...,V n ) which arises as the 
observed distribution of some BHM process according to (pQ). 

As we will see in lSection 4.11 different processes (tt, T, E, H t , Vt) can give rise to the same 
observed distribution on the Vt, for example by permuting the labels of the hidden variables, 
or by other relations among the parameters. 

Those already familiar with Markov models in some form may note that: 

• The data (tt, T, H t ) alone specify what is ordinarily called a binary Markov chain process 
on the nodes Ht. In the applications we have in mind, these nodes are unobserved 
variables. 

• The matrices T and E are assumed to be stationary, meaning that they are not allowed 
to vary with the "time index" t of (Ht, Vt). 

• The distribution tt is not assumed to be at equilibrium, i.e. we do not assume that 
ttT = tt. This allows for more diverse applications. 

N.B. 2.3. The term "stationary" is sometimes also used for a process that is at equilibrium; 
we will reserve the term "stationary" for the constancy of matrices T,E over time. 



(pchonhuthl . 1201 II ) uses T for different matrices, which I will later denote by P. 
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2.2 Binary Hidden Markov maps, models, varieties, and ideals 

Statistical processes come in families defined by allowing their parameters to vary, and in 
short, the set of probability distributions that can arise from the processes in a given family 
is called a statistical model. The Zariski closure of such a model in an appropriate complex 
space is an algebraic variety, and the geometry of this variety carries information about the 
purely algebraic properties of the model. 

In a binary hidden Markov process, w, T, and E must be stochastic matrices, i.e. each of 
their rows must consist of non- negative reals which sum to 1, since these rows are probability 
distributions. We denote by st the set of such triples (tt,T,E), which is isometric to the 
5-dimensional cube (Ai) 5 . We call s t the space of stochastic parameters. It is helpful to also 
consider the larger space of triples (ir, T, E) where the matrices can have arbitrary complex 
entries with row sums of 1. We write 0c for this larger space, which is equal to complex 
Zariski closure of s t, and call is the space of complex parameters. 

We will not simply replace s t by Qq for convenience, as has sometimes been done in 
algebraic phylogenetics. For the ring of polynomial functions on these spaces, we write 

C[9] := Cfc^Eij] (} = ^^ = Yj Ti i = Yl E H for * = °> 1 

3 3 3 

so as to make the identification Q Rt C ©£__= SpecC[0]. Here Spec denotes the spectrum 



of a ring; see (ICox. Little, and O'Sheal . 120071 ) for this and other fundamentals of algebraic 
geometry. 

Now we a fix a length \v\ = n for our binary sequences v, and write 

R P ,n := C[ Pv | v £ {0, l} n ] Cf := Spec(i? p , n ) 

R P ,n ■= R P ,n/ (1 - Pv) Cp"" 1 := Spec( J R Pi „) 

If" 1 :=Proj(i2p, n ) 

We will often have occasion to consider the natural inclusions, 

i"n ■ c y Cp i n . Cp c y Pp 

Convention 2.4. Complex spaces such as C 2 ™ will usually be decorated with a subscript to 
indicate the intended coordinates to be used on that space, like the p in Cp" above. Likewise, 
a ring will usually be denoted by R with some subscripts to indicate its generators. 

Definition 2.5. For n > 3, 

• The Binary Hidden Markov map or modeling map on n nodes is the map 0bhmm(«); 
or simply <fi n , given by given by ([1]), i.e. 

0„ : e c -»• cf - 1 , 

n 

4>t(j>v)-= n hiE hliVl Y[T h ._ lh .E hi>v . 

h£{0,l} n i=2 

The word "model" is also frequently used for the map <j) n . This is a very reasonable 
usage of the term, but I reserve "model" for the image of the allowed parameter values: 
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• BHMM(n), the Binary Hidden Markov model on n nodes, is the image 

Tn^n (e st ) CPf" 1 , 

of the stochastic parameter space G s t, i.e., the set of observed distributions which can 
arise from some BHM process, considered as a subset of Pp" -1 via T n . Being the con- 
tinuous image of the classically compact cube s t — Af , BHMM(n) is also classically 
compact and hence classically closed. 

• BHMM(n), the Binary Hidden Markov variety on n nodes, is the Zariski closure 
of BHMM(n), or equivalently the classical closure of 0n(©c); in IPp" -1 - 

• iBHMM(n)) the Binary Hidden Markov ideal on n nodes, is the set of homoge- 
neous polynomials which vanish on BHMM(ra), i.e., the homogeneous defining ideal 
of BHMM(n). Elements of iBHMM(n) are called invariants of the model. 

In summary, probability distributions arise from processes according to modeling maps, 
models are families of distributions arising from processes of a certain type, and the Zariski 
closure of each model is a variety whose geometry reflects the algebraic properties of the 
model. The ideal of the model is the same as the ideal of the variety: the definition of Zariski 
closure is the largest set which has the same ideal of vanishing polynomials as the model. In 
a rigorous sense (namely, the anti-equivalence of the categories of affine schemes and rings), 
the variety encodes information about the "purely algebraic" properties of the model, i.e. 
properties that can be stated by the vanishing of polynomials. 

The number of polynomials that vanish on any given set is infinite, but by the Hilbert 
Basis theorem, one can always find finitely many polynomials whose vanishing implies the 
vanishing of all the others. This is called a generating set for the ideal. To compute a 
generating set for iBHMM(n) > we wm need the following proposition: 

Proposition 2.6. The ideal iBHMM(n) the homogenization o/ker(^„ o t„ ) with respect to 
PS := 52\ v \ =n Pv 

Proof. The affine ideal ker(</>„ o 4T) cuts out the Zariski closure X of t n o <ft n (@c) i n Cp , and 
this closure lies in the hyperplane {ps = 1} = C? -1 . Let X' be the projective closure of X 
in IPp™ -1 , so that I(X') is the homogenization of ker(0* o tft) with respect to ps. 

The cube st is Zariski dense in 0c, so t ri o n (0 s t) is Zariski dense in t n o n (©c), 
which is Zariski dense in X, which is Zariski dense in X' . Therefore X' = BHMM(n), and 
= lBHMM(n)> as required. □ 

2.3 HMMs with more visible states via BHMM(n) 

All the results of this paper apply to HMMs with more than two visible states, using the 
following trick. Consider HMM(2, k, n), an HMM with 2 hidden states, k visible states 
cex . . .atk, and n (consecutive) visible nodes. Such a hidden Markov process can be specified 
by a 2 x k matrix E of emission probabilities, along with a 1 x 2 matrix ir and a 2 x 2 matrix 
T describing the two-state hidden Markov chain as in © . For each £ € {1 . . . , k}, we have a 
way to interpret this process as a BHM process by letting otj = 1 and a% = for i 7^ j. The 
resulting binary emission matrix is 

1 — E £ E Q £ 

1 — Eu Eu\ ' 

so as t varies, we obtain all the entries E{j as entries of an E'{1). We shall remark throughout 
when results can be generalized to HMM(2, k, n) using this trick. 



E\l) = 
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3 Defining equations of BHMM(3) and BHMM(4) 



Theorem 3.1. The homogeneous ideal Ibhmm(4) of the binary hidden Markov variety BHMM(4) 
is minimally generated by 21 homogeneous quadrics and 29 homogeneous cubics. 



Since lSchonhuthl (|201ll ) found numerically that his Hankel minors did not cut out BHMM(4) 
even set-theoretically, these equations are genuinely new invariants of the model. Moreover, 
they are not only applicable to BHMM(4), because a BHM process of length n > 4 can 
be marginalized to any 4 consecutive hidden-visible node pairs to obtain a BHM process of 
length 4. Thus, we have n — 3 linear maps from BHMM(n) to BHMM(4), each of which allows 
us to write 21 quadrics and 29 cubics which vanish on BHMM(n). Finally, using [Section 2.31 
we can even obtain invariants of HMM(2, k, n) via the k different reductions to BHMM(n). 

Our fastest derivation of lTheorem 3.1l in Macaulay2 (jGravson and StiilmanllGravson and Stillmanl ) 
uses the birational parametrization of ISection 4\ but in only a single step, so we defer the 
lengthier discussion of the parametrization until then. Modulo this dependency, the proof is 
described in ISection 3.31 using moment coordinates (ISection 3. 1 1) and cumulant coordinates 
(ISection 3.2h . 

In probability coordinates, the generators found for Ibhmm(4) na cl the following sizes: 

• Quadrics g 2 ,i, ... ,92,21- respectively 8,8,12,14,16,21,24,24,26,26,28,32,32,41, 
42, 43, 43, 44, 45, 72, 72 probability terms. 

• Cubics 53,1,..., 53,29: respectively 32,43,44,44,44,52,52,56,56,61,69,71,74,76,78, 
81, 99, 104, 109, 119, 128, 132, 148, 157, 176, 207, 224, 236, 429 probability terms. 

As a motivation for introducing moment coordinates, we note here that these generators have 
considerably fewer terms when written in terms of moments: 

• Quadrics 52,1 > • • • , 52,21 : respectively 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 10, 10, 10, 17 
moment terms. 

• Cubics 53,1, . . . , 53,29: respectively 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 10, 10, 10, 10, 10, 12, 
12, 13, 14, 16, 18, 21, 27, 35 moment terms. 

To give a sense of how these polynomials look in moment coordinates, the shortest quadric 
and cubic are 

• 52,1 = m 2 3mi 3 - m 2 m lu - m 13 m 12 + mimm, and 



53,1 



m 



12 



2m imi2 m 12 3 + m m? 23 + m?m 1234 - m m 12 m 1234 . 



A 



3,3 



(2) 



Let us compare this ideal with Ibhmm(3)> the homogeneous defining ideal of BHMM(3). 
Schonhuth! (j201lh found that Ibhmm( 3 ) is precisely the ideal of 3 x 3 minors of the following 
matrix: 

"pooo + P001 Pooo P100 
P010+P011 P001 P101 
P100 + P101 P010 P110 

_PllO+Plll P011 Pill. 

Schonhuth defines an analogous matrix A n ^ for BHMM (n), but then remarks that J. 

Hauenstein has found, using numerical rank deficie ncy testing dBates. Hauenstein. Peterson, and Sommese 
2010 ) with t he algebraic geometry package Bertini (IBates. Hauenstein. Sommese. and Wamplerl . 
Bates et al.l ). that minors3(A n ,3) does not cut out BHMM(n) when n = 4. In general, 
Schonhuth shows that iBHMM(n) = ( m i n ors3(^4„,3) : minors2(-B n ,2)) for a particular 2x3 ma- 
trix 2, but computing generators for this colon ideal is a costly operation, and so no gener- 
ating set for iBHMM(n) was n °t found for any n > 4 by this method. I nstead, here we will make 
use o f moment coordinates and cumulant coordinates as exposited in (jSturmfels and Zwiernik . 

201 ih. 
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3.1 Moment coordinates 



Moments are particular linear ex pressions in probabilitie s . Th ey can be derived from a 
moment generating function as in ( Sturmfels and Zwiernik . 20 111 ), but in our case, moments 
can be expressed simply by the following rule: we order {0, l} n by strict dominance, i.e.v > w 
iff Vi > Wi for all i, and then 

m v := Pw £ R p ,n (3) 

W>V 

Since all our variables are binary, with the usual algebraic statistical convention that a "+" 
subscript denotes an index to be summed over, we can view the conversion from moments 
to probabilities as "replacing zeros by + signs". For example, miooio = P1++1+- The ring 
elements m v G R p , n provide alternative linear coordinates on IPp™ -1 in which it turns out that 
some previously intractable BHM computations are simplified and become feasible. 

For a more compact notation, a binary string v of length n is the indicator function of 
a unique subset / of [n] = {1, . . . ,n}, so we also write m; to represent m v . For example, 
moooo = m 0] miooo = m ii and 7770101 = 77724. From ([3]) we can see that mj actually represents 
a marginal probability: m/ = Pr(Vi = 1 for all i S I). Thus, in the context of BHMMs , no 
confusion results if we write m; without specifying the value of n. To be precise, if I C [n] 
and /' denotes / considered as a subset of [n'j for some n' > n, then 



4>t{mi) = 4>Z>( m i') 



(4) 



This can be seen in many ways, for example using the lBaum formula for moments! ! Proposition 5.1 ) 
as explained in ISection 5.31 

Just as for probabilities, for moments we define rings and spaces 



Rr 



■= C[mi\I C [n]] 

:= Rm,n/ (1 - 1TIQ) 



c 


2" 




m 


2" 


-1 


m 






-1 


III 





Spec(i? m;n ) 

Spec( J R mj „) (5) 
Proj(i2 m)n ), 

To avoid having notation for too many ring isomorphisms, we adopt: 

Convention 3.2. Using Q, we will usually treat mj as a literal element of R p , n , thus 
creating literal identifications 



Rm,n — Rp,m Rm,n — Rp,m 



^2" 



and C r 



>.2™-l 



(6) 



Note that, for example, we obtain natural ring inclusions 

Rm,n ^= Rm,n' 

whenever n < n', which respect the BHM maps 4> n because of dU). 
As a first application of moment coordinates, we have 

Proposition 3.3. The homogeneous ideal Ibhmm(3) is generated in moment coordinates by 
the 3x3 minors of the matrix 



3,3 



"T-000 ™<000 "il00 

"1-010 "1001 i™-ioi 

"llOO "1010 1"110 

"iiio in-on mm 



7710 
7T1-2 
777-1 



7T70 7771 
7773 77713 
777 2 77712 



77712 77723 "7123 



In particular, the projective variety BHMM(3) is cut out by these minors. 
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Proof. Observe that Schonhuth's matrix ^^3 in ([2]) is equivalent under elementary row/column 
operations to A' 3 3 , so minors3 A' 3 3 = minors3 ^3 = Ibhmm(3)- d 



[f 

Proposition 3.4. The ideal iBHMM(n) ^ s the homogenization o/ker(^„) with respect to m® 



Proof. From Proposition 2.6 we know that iBHMM(n) is the homogenization of ker(0n in) 
with respect to mg = ^2\ v \ =n Pv From ([5]), we can identify R m> 4 with the polynomial subring 
of R m ,A obtained by omitting m$, so that ker(0^ o tj ) = kei((j)^ ) + (1 — m$). Since the 
additional generator 1 — m§ homogenizes to 0, ker(0^ ) has the same homogenization as 
ker(^>4 o ij ), hence the result. □ 

3.2 Cumulant coordinates 

Cumulants are non-linear expressions in moments or probabilities which seem to allow even 
faster computations with binary hidden Markov models. Let 

Rk,n 
Rk,n 

cf- 1 



i\IQ[n\] 

Rk,n/ {k$} 

Spec(i? fc>n ) 



where, as with moments, we may freely alternate between writing k v and writing ki, where 
/ is the set of positions where 1 occurs in v. For building generating functions, let x%, . . . , x n 
be indeterminates, and write x v = x 1 for x" 1 • • • xZ n = TLw Xj. Let J be the ideal generated 
by all the squares xf. Following ( Sturmfels and Zwiernikl . 1201 ll ). we define the moment and 



cumulant generating functions, respectively, as 

fm(x) := mix 1 £ R mjn [x]/J f k (x) : = k I x I G R kjn [x\/J 

IC[n] IC[n] 

We now define changes of coordinates 

. (p2«-l r 2"-l -1 . r 2"-l r 2«-l 

by the formulae 

4{h) = iog(/ m ) = + • • • + (-ir +l (/m ~ 1)n (?) 

1 n 

K -#(/ m )=exp(A) = l + ^ + ... + ^ 

1 n! 

That is, we let K„(fcj) be the coefficient of x in the Taylor expansion of log/ m about 1, and 
let Kn^(mi) be the coefficient of x 1 in the Taylor expansion of exp/^ about 0. Note that in 
the relevant coordinate rings R m ,n and Rk,n, m§ = 1 and fcg = 0. This is why we only need 
to compute the first n terms of each Talyor expansion: the higher terms all vanish modulo 
the ideal J. 

II n 

Proposition 3.5. The expressions Kn(kj) and K n (mi), i.e. writing of cumulants in terms 
of moments and conversely, do not depend on n. 



Proof. In (jSturmfels and Zwiernikl . l201ll ) , these formulae are re-expressed using Mobius func- 



tions, which do not depend on the generating function description above, and in particular 
do not depend on n. □ 
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3.3 Deriving Ibhmm(4) in Macaulay2 

This section describes the proof of lTheorem 3.11 using Macaulay2. These computations were 
carried out on a Toshiba Satellite P500 laptop running Ubuntu 10.04, with an Intel Core i7 



Q740 .73 GHz CPU and 8gb of RAM. In light of Proposition 3.4, we will aim to compute 
ker(<^>4 o l± ), which can be understood geometrically as the (non-homogeneous) ideal of the 
standard affine patch of BHMM(4) where ttiq = Yl\v\=4Pv = 1- To reduce the number of 



variables, as in Proposition 3.4| we continue to make the identification 



RmA = C[m/|0 + I C [4]] C R mA 

We begin by providing Macaulay2 with the map 4>f : R m ,4 — > C[0] in moment coordinates 
(jSection 3. 1[) . because probability coordinates result in longer, higher degree expressions. This 
can be done by composing the expression of 4>n{p v ) in Definition 2.5l with the expression of 



m v = mj in ([3]), or alternatively using the Baum formula for moments (Proposition 5.1), 
which involves many fewer arithmetic operations. 

Macaulay2 runs out of memory (8gb) trying to compute ker(<^Jj, and as expected, this 
memory runs out even sooner in probability coordinates, so we use cumulant coordinates 
instead (jSection 3.2|) . We input 

K-f '■ Rk,4 ~ > Rm,4 

using coefficient extraction from ([7]), and compute the composition <j)f o k%. Then, it is 
possible to compute 



ker^^ o nf) 



which takes around 1.5 hours. Alternatively, we can compute 4 using the birational 
parameterization ^4 of lSection 41 in place of </>4, which takes less than 1 second and yields 
100 generators for 1^4. 

Subsequent computations run out of memory with this set of 100 generators, so we must 
take some steps to simplify it. Macaulay2's trim command reduces the number of generators 
of Ifc 4 to 46 in under 1 second. We then order these 46 generators lexicographically, first 
by degree and then by number of terms, and eliminate redundant generators in reverse order, 
which takes 19 seconds. The result is an inclusion-minimal, non-homogeneous generating 
set for Ifc ; 4 with 35 generators: 24 quadrics and 11 cubics. 

Now we compute I m ,4 '■= i<^{Ik,i) = K&(ker(<f)f o nf)) = ker(</>*), i.e., we push forward 
the 35 generators for I*. 4 under the non-linear ring isomorphism to obtain 35 generators 
for I TO) 4 = ker(</>4 ): 2 quadrics, 7 cubics, 16 quartics, 5 quintics, and 5 sextics. In under 1 
second, Macaulay2's trim command computes a new set of 39 generators for I m> 4 with lower 
degrees: 21 quadrics, 14 cubics, and 4 quartics, which turns out to save around 1 hour of 
computing time in what follows. These generators have many terms each, and eliminating 
redundant generators as in the previous paragraph turns out to be too slow to be worth it 
here, taking more than 2 hours, so we omit this step. 

Finally, we apply [Proposition 3.4 to compute Ibhmm(4) as t ne homogenization of I m> 4 with 



respect to rriQ. In Macaulay2, this is achieved by homogenizing the 39 generators for I mj 4 with 
respect to and then saturating the ideal they generate with respect to m@. This saturation 
operation takes about 29 minutes, and yields a minimal generating set of 50 polynomials: 
21 quadrics and 29 cubics. Since probabilities are linear in moments, their degrees are the 
same in probability coordinates. Moreover, since these are homogeneous generators for a 
homogeneous ideal, they are minimal in a very strong sense: 

Corollary 3.6. Any inclusion-minimal homogeneous generating set for Ibhmm(4) * n proba- 
bility or moment coordinates must contain exactly 21 quadrics and 29 cubics. 
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We still do not know a generating set for Ibhmm(5) • Macaulay2 runs out of memory (8gb) 
attempting to compute 1^5, even using the birational parametrization of ISection 41 The 
autho r has also attempted this computation using the tree cumulants of Smith and Zwiernik! 

in place of cumulants, but again Macaulay2 runs out of memory trying to compute the 
first kernel. Presumably the subsequent saturation step would be even more computationally 
difficult. 



4 Birational parametrization of BHMMs 

Theorem 4.1 (Birational Parameter Theorem). There is a generically two-to-one, dominant 
morphism @c — > C 5 such that, for each n > 3, the binary hidden Markov map <j) n factors 
uniquely as follows, and each Tp n : C 5 — > BHMM(ra) has a birational inverse map p n : 




BHMM(n) 

Pn 

In particular, BHMM(n) is always a rational projective variety of dimension 5, i.e., bira- 
tionally equivalent to P 5 . 

Using the reduction of ISection 2.3^ the same is true if we allow k > 2 visible states in 
the model and replace 5 by 3 + k. This theorem will be proven in ISection 5.6I using trace 
algebras and the Baum formula for moments. In the course of this section and ISection "51 
we will exhibit formulae for ip n and their inverses p n . The inverse map p% has a number of 
practical uses, to be explored in ISection 6l 

Our first step toward [Theorem 4.11 is to re-parametrize Oc- 

4.1 A linear reparametrization of Gc 

Since the hidden variables Ht are never observed, there is no change in the final expression of 
p v in IDcfinition 2.5l if we swap the labels {0, 1} of all the Ht simultaneously. This swapping 
is equivalent to an action of the elementary permutation matrix a = (5 q) : 

sw : C — > ©c 

6 = (vr, T, E) t-> (trt, a~ x Ta, a^E) (8) 

(In our case a~ 1 = a, but the form above generalizes to permutations of larger hidden 
alphabets.) Hence we have that Pv(v \ ir, T, E) = Pr(f | sw(-7r, T, E)), i.e. <p n — (fi n c sw. 

We will make essential use of a linear parametrization of ©c in which sw has a simple form. 
Our new parameters will be rjo := (ao, b, cq,u,vo), with subscript O's to be explained shortly. 
Although we have already used the letter v at times to represent visible binary strings, we 
hope that the context will be clear enough to avoid confusion between these usages. We let 

1 

2 



7T = - [1 - a , 1 + ao] 



1 + 6 -c , 1-6 + c 
1-6- c , 1 + 6 + c 



E 



1 — u + v , u - V Q 
l — u — vo, u + v 



(9) 



(The rightmost column of E is made intentionally homogeneous in the new parameters.) We 
can linearly solve for r]Q in terms of 9 by ao = tv\ — ttq etc., so in fact (ao, 6, cq,u, vq) generate 
the parameter ring C[9]. In these coordinates, sw acts by 

ao h-> — ao, 6 i y 6, cq h- > —cq, u h-> u, vq \-t —vq 
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In other words, swapping the signs of the subscripted variables oq,cq,vq has the same effect 
as acting on the matrices tt, T, E by a as in ([8]) , i.e., relabeling the hidden alphabet. 

4.2 Introducing the birational parameters 

Since <p n o sw = n , by classical invariant theory the ring map <p n : R Py n — > C[8] must land 
in the subring of invariants C[#] sw = C[b, u, ajj, Cq, Vq, aoco, aoVQ, cqVo]. However, 4>n in fact 
factors through a smaller subring, conveniently generated by 5 elements: 

Lemma 4.2 (Parameter Subring Lemma). For all n > 3, the ring map <pn lands in the 
subring 

C[n] := C[a, b, c, u, v] 
ofC[9], where a = a$vo, c = cqVq, v = Vq. 

The proof of this key lemma will be given in ISection 5.51 after introducing trace algebras. 
To interpret its geometric consequences, write for the subring inclusion 

q# : C[rj\ ^ C[9] 

a i y (iq'Uq, b i y 6, c h- > cqvq, ut-^-u, v i— > Vq, 

write ■i/'rf : Rp,n ~^ C[r/] for the factorization of 4>n through q#, and write 0^ := SpecC[r/], 
so ~ C 5 . The result: 

Corollary 4.3. The following diagram of dominant maps commutes 

a ib n 

9 C (-)'■ > BHMM(n) 




and q is generically two-to-one. 

This corollary in particular implies the first part of the IBirational Parameter Theoreml 
(|4.ip . by taking q : 0c — > @' c — C 5 as the generically 2 : 1 map. 

Remark 4.4. The map q is only dominant, and not surjective; for example, it misses the 
point (1,0,0,0,0). 

Corollary 4.5. For all n > 3, BHMM(n) = image (Z n ^ n ). 

Proof. Since q is dominant, image(Z n ^ n ) = image(Z n V'nq) = image(Z n n ) =: BHMM(n). □ 

The unique factorization map ipn can be computed directly in Macaulay2 for small n. 
The expressions in moment coordinates are simpler than in probabilities, so we present these 
in the following proposition. 
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J± 

Proposition 4.6. The map is given in moment coordinates by 

m$ = m 000 (->■ 1 
mi = mioo i->- o + it 
rri2 = moio i— S- a6 + c + u 
7773 = mooi i-> a& 2 + bc + c + u 
m\2 = mno i-> a&u + ac + au + cu + -u 2 + bv 
mi3 = mioi i y ab 2 u + abc + bcu + 6 2 tj + ac + au + eu + u 2 
17123 = moil ^ a& 2- u + a&c + abu + 6cu + c 2 + 2cu + u 2 + bv 
7^123 = JTiiii ' ^ db u ~\~ 2abcu + abu + 6eu + 6 uv + ac + 2acn 
+ c 2 u + au 2 + 2eu 2 + u 3 + ootj + 6cw + 26it?; 

We will eventually prove the IBirational Parameter Thcorcml (|4. 1[) by marginalization to 
the case n = 3, which we can prove here: 

Proposition 4.7. The following triangular set of equations hold on the graph of ip 3 , after 
clearing denominators, and can thus be used to recover parameters from observed moments 
where the denominators are non-zero: 

_ m 3 - m 2 

7TT-2 — mi 



u 



mi m 3 -m 2 + m 23 -m 12 



2(7773 - m 2 ) 
a = mi — u 
c = a — ba + 777-2 — m\ 

o 777,1777,2 - 777 i 2 

v = a 

b 

(This proposition and the following corollary actually hold for all <p n with n > 3, because of 
Proposition 5.2[ and by lSection 2.31 these same formulae can be used to recover parameters 



for HMM(2, k, n) when k > 2 as well. 



Proof. These equations can be checked with direct substitution by hand from Proposition 4.6 
Regarding the derivation, they can be obtained in Macaulay2 by computing two Grobner bases 
of the elimination ideal I = (m v — 03(?77,„)|7j G {0, l} 3 ) over the ring C^, in Lex monomial 
order: once in the ring R m;3 [v,c,a,b,u], and once in R mi3 [v,c,u,b,a]. Each variable occurs 
in the leading term of a some generator in one of these two bases with a simple expression 
in moments as its leading coefficient. We solve each such generator (set to 0) for the desired 
parameter. □ 

Corollary 4.8. The map ip 3 : C 5 — >• BHMM(3) has a birational inverse p 3 . The map pf on 

moment coordinate functions is given by: 

m 2 + 77737771 - 27772777! - 777 2 3 + ™12 ~m\ + 7773777! + 777 23 " 77712 

a i y r u i y r 

2(7773 - m2) 2(7773 - m 2 ) 

777,3 — "72 num(t;) 

b !->• v !->■ 



7772 — 777i 4(?773 — 7772) 

num(c) 

c !->• — r- r , where 

2(7772 - mi)(m 3 - m 2 ) 
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num(c) = — mim\ + m 2 m 3 + m\m 3 — mim\ — m\m\2 

+ 2m 2 mi2 - m 3 mi2 + mim 23 - 2m 2 m 23 + m 3 m 23 , and 

/ \ A 

num(i)) = m 2 — 2mim 2 m 3 + m 1 m 3 — 2m 2 mi 2 — 2mim 3 mi 2 + 4m 2 m 3 mi 2 
+ 4mim 2 m 2 3 - 2m 2 m 23 - 2mim 3 m 23 + m 2 2 - 2mi 2 m 23 + m| 3 . 

Proof. This can be derived by substituting the solutions for u, a, and 6 in the previous 
propositions into the subsequent solutions for a, c, and u. Alternatively, it can be checked by 
direct substitution in Macaulay2, i.e., one computes that ipf 1 o p&{6) = 6 for each birational 
parameter 9 G {a, 6, c, □ 



The expressions in Corollary 4.8 are considerably simpler in moment coordinates than in 
probabilities. Comparing the number of terms, the numerators for a, b, c, u, v respectively 
have sizes 5, 2, 10, 4, and 12 in moment coordinates, versus sizes 22, 4, 56, 22, and 190 in 
probability coordinates. This explains in part why Macaulay2's Grobner basis computations 
execute in moment coordinates with much less time and memory. 



4.3 Statistical interpretation of the birational inverse p% 

It turns out that the factors appearing in the denominators of |Corollary 4.8 defining p 3 have 
simple factorizations in terms of the rational and birational parameters: 

• m 3 — m 2 appears in the denominator of all p^{9) except ps(b), and 

m 3 -m 2 i-4 (b)(ab-a + c) A- (b)(v )(a b - a + c ) 

• m 2 — m\ appears in the denominator of p 3 (6) and /03(c), and 

m 2 — mi ab — a + c A (vo)(ao& — ao + cq) 



Let us pause to reflect on the meaning of these factors. 

• The factor vq occurs in det(E) = 2vo, hence v = Vq = iff the hidden Markov chain 
has "no effect" on the observed variables. The image locus </> 3 ({vo = 0}) can thus 
be modeled by a sequence of IID coin flips with distribution Eq = E\ = (1 — u,u), 
so the BHMM is an unlikely model choice. This is a one-dimensional submodel, 
parametrizable by u £ [0,1], with a regular (everywhere-defined) inverse given simply 
by u = mi. Denote this model by BIID(ra). 

• The factor b occurs in det(T) = b, hence b = iff each hidden node has "no effect" on 
the subsequent hidden nodes. In this case, the observed process can be modeled as a 
sequence of independent coin flips, the first flip having distribution (1 — a, a) := ttE 
and subsequent flips being IID having distribution (1 — /3,/3) := TqE = TiE. The 
image locus (/> 3 ({6 = 0}) is hence a two-dimensional submodel, parametrizable by 
(a,/3) G [0, l] 2 , with a regular inverse given by a = mi,/3 = m 2 . Denote this model 
by BINID(re), for "binary independent nearly identically distributed" model, and note 
that BINID(ra) 5 BIID(n) by setting a = /3. 

• The factor a^b — ao + cq occurs in ttT — ir = ^(—aob + ao — cq, aob — ao + cq). Hence 
aob — ao + co = iff ir is a fixed point of T, i.e. the hidden Markov chain is at equilibrium. 
We may define the Equilibrium Binary Hidden Markov model by restricting <p n to the 
locus {006 — ao + Co = 0}), which turns out to yield a four-dimensional submodel 
for each n > 3. Denote this submodel by EBHMM(n). 
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It can be easily shown, with the same methods used here for BHMM(n), that EBHMM(n) 
itself has a birational parametrization by (eio^o, b, u, Vq) = (a,b,u,v), where ao,b £ [—1,1], 
Co := ao (1 — 6) € [|6| — 1, 1 — 16|], vo S [0,1], and u £ [\vq\, 1 — \vo\], with an inverse parametriza- 
tion given by 

m? - mi3 2mimi2 - m\mxz - mi23 



1 - "ii2 2(mf - mi 3 ) 



mi — u v 



a 2 b — m\ + mi2 



The newly occurring denominators here are m\ — mi2 = (b)(a 2 — v ) = (&)(t>o) 2 (aQ — 1) and 
m\ — mi3 = (6) 2 (a 2 — u) = (6)(vo) 2 (ao — !)■ It easy to check that the only points of EBHMM(n) 
where these expressions vanish are points that lie in BINID(n). Thus, for n > 3, BHMM(n) 
can be stratified as a union of three statistically meaningful submodels 

BHMM(n) = BINID(n) <- 2 dimensional 

U (EBHMM(n) \ BINID(n)) <- 4 dimensional 

U (BHMM(n) \ (EBHMM(n) U BINID(n))) <r- 5 dimensional 

each of which has an everywhere-defined inverse parametrization. 

4.4 Computational advantages of moments, cumulants, and 
birational parameters 

Our approach has been to work with moments m v and cumulants k v instead of probabilities 
p v , and the birational parameters a,b,c,u,v instead of the matrix entries 7Q,tii> e il- Other 
than the theoretical advantage that the model map is generically injective on the birational 
parameter space, significant computation gains in Macaulay2 also result from these choices 
(see lSection 3.31 for laptop specifications): 



Computing ker-03 = kerc/>3, the affine defining ideal of BHMM(3), took less than 1 
second in Macaulay2 when using the birational parameters, compared to 25 seconds 
when using the matrix entries and moments, and 15 minutes when using the matrix 
entries and probabilities. 



Computing ker^ = ker</>4, the affine defining ideal of BHMM(4) took less than 1 
second in Macaulay2 when usi ng the birational parameters and cumulant coordinates 
( Sturmfels and Zwiernik . 201 ll ). compared to 1.5 hours when using the matrix entries 



and cumulant coordinates, and running out of memory (8gb) when using the matrix 
entries and probabilities. 



5 Parametrizing BHMMs though a trace variety 



In this section, we exhibit a parametrization of every BHMM through a particular trace 
variety called SpecC2,3, which itself can be embedded in C 10 . We use these coordinates 
to prove the IBirational Parameter Theoreml (|4.ip and the Parameter Subring Lemma (|4.2p , 
which were stated without proof. 

For this, we will define a map </>oo through which all the <f> n factor, and using a version of 
the Baum formula for moments, we factor this map furthe r though Spec C2.3. Then we use 
a finite set 10 of generators of the ring 62,3 exhibited by (Sibirskii, 196 8) to show that the 
image of (poo lands in the desired subring C[rj\, and write tpoo for the factorization. Finally, 
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by marginalizing to the case n = 3, we obtain a birational inverse for ip n from the map ps 
given in 



Corollary 4.8 



5.1 Marginalization maps 

For each pair of integers n' > n > 1, the marginalization map /i™' 



^ ^ Pvw 

\w\=n'— n 



These restrict to maps : 1 



Cp" 1 , and define rational maps /i£ 



Cp" is given by 



a n -1 
p 

n'#, 



j2 n -l 



111 



wO 



In moment coordinates, these maps are actually coordinate projections: (m v 
where denotes a sequence of n' — n zeros. In fact, using the subset notation for moments 
m;, the corresponding ring maps are literal inclusions: /i™ {mi) = mj. In other words, 
C^j is just the map which forgets those mj where I ^ [n]. 



5.2 The Baum formula for moments 

Equation ([1]) involves 0(2 n ) addition operations. There is a faster way to compute <j>f (p„), us- 
ing O (n) arithmetic operations, by treating the BHM process as afinitary process (jSchonhuth . 
2011 ). We define two new matrices^] 



(Pi)jk '■= E ji T jk = Pr(Vt = i and H t+ \ =k\H t = j and ir, E, T), that is, 



Pn 



PinPio TnEio 



and Pi 



PooPoi PoiPoi 
PioPii TnEii 



Writing 1 for the vector (J) we obtain the matrix expression (p v ) = ttP Vi P V2 ■ ■ ■ P Vv 1 
which involves only 4n + 2 multiplications and 2n + 1 additions. This is known as the Baum 
formula. We can rewrite this formula as a trace product of 2 x 2 matrices: 

(t>*{Pv) = trace(-7rP„ 1 P„ 2 • • • P Vn t) = trace((l7r)P 1 , 1 P )2 • • • P Vn ) 

To create an analogue of this formula in moment coordinates, we let 



M := Pn + Pi 



Mi := P 



M 2 := Ivr 



7r 7Tl 
7T 7Tl 



Proposition 5.1 (Baum formula for moments). The binary hidden Markov map <p n can be 

written in moment coordinates as 



l"2 



4>*(m v ) = trace(M 2 M^M, 
For example, (f>f (moiooi) = t r ace ( M 2 M M\ M M Mi } 
Proof. By our definition of m v (|3|) , we have 

4>*{m v ) = ^2 &t(.Pw) = y^trace((lvr)P i , 1 P I 



M,„ 



w>v 



w>v 



Pwn) 



trace (Ivr) ^ P Wl \ I Yl P ^ 
tv&ce(M 2 M Vl M V2 ■ ■ ■ M Vn ) 




□ 



i P can be thought of naturally as a 2 x 2 x 2 tensor, but we will not make use of this interpretation. 
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5.3 Truncation and 4>oo 

Proposition 5.2. The binary hidden Markov maps (f) n form a directed system of maps under 
marginalization, meaning that, for each n' >n> 1, the following diagrams commute: 




Proof. This can be seen directly from the definition of <j) n using (pQ) and of m v in ([3]). Alter- 
natively, observe that because Mq = T is stochastic, MqM 2 = M 2 , so for any sequence of 
length n' — n, the IBaum formula for momentsl ( Proposition 5.l| ) implies that 



(10) 

□ 



Thus, to compute 4> n for all n, it is only necessary to compute those <finm v ' where v' ends 
in 1. Motivated by this observation, let Rm tOC ■= C[m v i \ v 6 {0, l} n for some n > 0] = 
C[rn-i, moi, m-n, mooi, ^loi) m oii) • • •]> which in subset index notation is simply 

Rm,oo '■= C[m/ I / C [n] for some n > 0] 
= C[mi, m 2 , m 12 , m 3 , m 13 ,m 2 3, ■ ■ ■} 



Then we define 4>oo : ®i 

Atngth( V l)( m ^)' le - 



Spec R rt 



and (ASo 



u 

R m ,oo by the formula 4>6o{m vl Q) := 



0#(m/) :=Cze(/)( m ^ 



# 



(11) 



Note that by locating the position of the last 1 in a binary sequence v' 7^ ... 0, we can write v' 
in the form vW for a unique string v (possibly empty if v' = 1), so this map is well-defined. By 
the same principle, for each n we can also define a "truncation" map r : Spec R mj00 — > C^ -1 
by T&(m vl Q) := rn v i, which, in subset index notation, is a literal ring inclusion: 

r#(mj) := mj (12) 
With this definition, cj)n factorizes o Tn • We can summarize this and 



Proposition 5.2 as follows: 

Proposition 5.3. For all n' > n > 1, the following diagrams commute: 
(-):■ 





C 



2"-l 
m 



Spec i? r , 



Rm,n 




m,oo 



Remark 5.4. These diagrams exhibit the rings R m ,n and maps 4>n as a directed system 



under the inclusion maps /i"' , such that i?. 



colimn^oo i? m) „ and 



lim 



Now, to prove that 4> n factors through q, we need only show that (p^ does. 



17 



5.4 Factoring c/)^ through a trace variety 

Let Xq,X\,X2 be 2 x 2 matrices of indeterminates, 



Xn 



x, 







x 2 = 


X200 


X201 


XllO 






X2X0 


X211 



and following the notation of (jDrenskyl . 120071 ). ^2,3 := Cfentries of Xq, Xi, X2] denotes the 
polynomial ring on the entries of these three 2x2 matrices. The trace algebra 6*2,3 is 
defined as the subring of ^2,3 generated by the traces of products of these matrices, 6*2,3 := 
C[trace(Xj 1 Xj 2 • • • Xj r ) | r > 1] C S7 2 ,3 and we refer to Spec 62,3 as a trace variety. We write 



v : Specf^s — > SpecC2; 



and 



2,3 



n 



2,3 



for the natural dominant map and corresponding ring inclusion. To relate these varieties to 
binary HMMs , we define two new maps : ^2,3 — > C[9] and : R m ,oo —> 62,3 by 



; # (JQ) := Mi 



and 



C # Ki) := trace ^X 2 JJx^ X ^ 



Proposition 5.5 (Baum factorization). The ring map <fioc factorizes as 
i.e., the following diagram commutes: 



Spec Q2 ; 



Spec R Tl 



Spec CV 



Proof. This is just a restatement of the lBaum formula for momentsl (Proposition 5.1 ): 



u*(v*(£#(m vl ))) =u* trace I 2 [Jli = trace Im 2 JJ Mi j 



□ 



5.5 Proving the parameter Subring Lemma| (14.21) 

We begin by seeking a factorization of the map o z^#. For this we apply the following 
commutative algebra result of Sibirskii on the trace algebras C2, T '- 

Proposition 5.6 (Sibirskii, 1968). The trace algebra C2, r is generated by the elements 



trace (Xj) 
trace(XjXj) 
tiace(XiXjXk) 



0<i<r 

< i < j < r 

0<i<j<k<r 



Corollary 5.7. The algebra 6*2,3 is generated by the 10 elements 
tr ace (Xo ), trace (Xi), trace (X2), 

trace(XQ), trace(Xf ), trace(X|), trace(XoXi), trace(XoX2), trace(XiX2), 
trace (X0X1.X2) 
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Proposition 5.8. The ring map o v# factors through the inclusion 

q* : C[rj\ := C[a, b, c, u, v] ^ C[6] := C[ao, b, cq, u, vq], 
i.e. we can write uj^ o 

q 



q# o r # so tfoaf ffo following diagram commutes: 



(~> 



Spec0 2 ,3 — — -*■ SpecC 2 ,3 

Proof. We apply to the ten generators of C 2 3 given in 
land in C[rj\. Explicit, we find that: 

trace(Mo) = b + 1 trace (Mi) = bu + c + u 

trace (M%) = b 2 + 1 
trace(Mf) = 1 



Corollary 5.7 and check that they 



trace (Mi M 2 ) = a + u 



trace(Mi) = bu + c + u trace(M 2 ) = 1 

, , = &V + 26cu+c 2 + 2cu + it 2 + 2k> 

'■ 2 ' ' trace(M M 2 ) = 1 



trace(M 1 2 ) = &V 
trace(M Mi) = b 2 u + bc + c + u 
trace(M MiM 2 ) = afe + c + it 



□ 



as 



Now, by letting := o we may factor the ring map T s .. 

= cj # o ^ # o = q# o r # o = q# o . 



Corollary 5.9. 77te following diagram commutes: 




Spec .R r , 



Specfi 



2.8 



Spec C 2 , 3 



Proof of the Parameter Subring Lemma Proposition 5.3 and Corollary 5.9 together im- 

ply that the following diagrams commute: 



©c *■ Speci? r , 



,2"_i 
"m 




In particular, the map $T factors through C[rj], as required. 



□ 
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5.6 Proving the IBirational Parameter Theorem! (14.11) 

Recall that Corollary 4.3 implies the first part of the IBirational Parameter Theoreml (14.1 
by taking 

q : <->:• > (-)';• 

as the generically 2 : 1 map. Thus, it remains to show that the maps 

ip n : 9c — > BHMM(n) 



have birational inverses p n . The inverse map p$ was already exhibited, in [Corollary 4.8[ and 
we obtain p n by marginalization: let 

Pn = P3 °^3- 

Let U C Q' c be the Zariski open set on which ip3 is an isomorphism with inverse p^. Consider 
the set ip n (U) C BHMM(n). I t is Zariski den se in BHMM(ra), and by Chevalley's theorem 
( Grothendieck and Dieudonnel . 1966. EGA IV . 1.8.4), it is constructible, so it must contain a 
dense open set W C BHMM(n). Now let W = ^(W), so we have ip n (W) = W C i/} n (U). 

Proposition 5.10. p n o tfj n = Id on W and ip n ° Pn = Id on W . 

Proof. Suppose rj G W. Then p n o ip n (fj) = P3° P3 , Pn(rj) = P3° ipsiv) = V since rj G U. Now 
suppose p G W, so p = ip n (v) f° r some rj G W. Then, applying Proposition 5.2 

Ipn ° Pnip) =lpn° Pn° 1pn(v) = Ipn ° P3 ° P3 1pn(jj) 
= Ipn P3 ^307) = V>n(?) = P 



□ 



This completes the proof of the IBirational Parameter Theoreml (j4. 1[) . In fact we have also 
proven the following: 



Theorem 5.11. For any n' > n > 3, there is a commutative diagram of dominant maps: 




BHMM(n) 



6 Applications and future directions 

Besides attempting to compute a set of generators for Ibhmm(5) > there are many other ques- 
tions to be answered about HMMs that can be approached immediately with the techniques 
of this paper. 
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6.1 A nonnegative distribution in BHMM(3) but not BHMM(3) 



It turns out that not all of the probability distributions (non-negative real points) of BHMM(n) 
lie in the model BHMM(n). In other words, BHMM(ra) n Ajf -1 / BHMM(n), so the model 
must be cut out by some non-trivial inequalities inside the simplex. To illustrate this, the 
following real point 9 of 0c does not lie in s t, but maps under <j)% to a point p of A^: 



e = (9,T,E) 




3 1 

4 4 

1 3 

4 4 



Moreover, the analysis of ISection 4.31 reveals that the fiber (f> 3 
point 9 and the "swapped" point 



i, 



e' = (tt', t', e') 



3 I 

4 4 

I 3 

4 4 



I 3' 

4 4 

3 I 

4 4. 



(13) 



consists only of the 



(14) 



which is also not in O st . Hence the image point p = (j)3(6) = $3(0') is a non-negative point of 
BHMM(3) that does not lie in BHMM(3). 



6.2 A semialgebraic model membership test 



In light of the fact that not every nonnegative distribution in BHMM(n) is in BHMM(ra), 
the defining equations of BHMM(n) are not sufficient to test a probability distribution for 
membership to the model. Using the method of ISection 2.31 membership to HMM(2,/c,n) 
can be tested by reducing to the k = 2 to recover the parameters. 

So, suppose we are given a distribution p £ A^™ -1 and asked to determine whether 
p £ BHMM(n). The following procedure yields either 

(1) a proof by contradiction that p ^ BHMM(n), 

(2) a parameter vector 9 £ s t such that 4> n (9) = p £ BHMM(n), or 

(3) a reduction of the question to whether p lies in one of the lower-dimensional submodels 
of BHMM(n) discussed in ISection 4731 

How to proceed from (3) is essentially the same as what follows, using the birational parametriza- 
tions of the respective submodels given in ISection 4.31 

To begin, we let p' = n^{p) £ A^ , i.e. we marginalize p to the distribution p' it induces 
on the first three visible nodes. Note that if p £ BHMM(n) then p' £ BHMM(3). Observing 



the moments mj of p', if any denominators in the formulae of Corollary 4. 8| vanish, then we 
end in case (3). 

Otherwise, we let (a,b,c,u,v) = i^ 3 ~ 1 (p'), choose vo to be either square root of v, and let 
ao = cl/vq, Co = c/vq. If p were due to some BHM process, then by ITheorem 5.1l] these 
would be its parameters, up to a simultaneous sign change of (ao,bo,vo). With this in mind, 
we define 9 = (ir,T,E) using Q. If (ir,T,E) are not non-negative stochastic matrices, then 
p £ BHMM(n) and we end in case (1). If they are, we compute p" = 4> n (9), and if p = p" 
then we end in case (2). Otherwise p must not have been in BHMM(n), so we end in case 

(!)• 

Note that since all the criteria in this test are algebraic equalities and inequalities, this 
procedure implicitly describes a semialgebraic characterization of BHMM(n) for all n > 3. 
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6.3 Identifiability of parameters 

By a rational map on a possibly non-algebraic subset G C C k , we mean any rational map on 
the Zariski closure of 0, which will necessarily be defined as a function on a Zariski dense 
open subset of 0. We define polynomial maps on similarly. 

Let : — > C n be an algebraic statistical model, where as usual we assume C C k 
is Zariski dense, and therefore Zariski irreducible. A (rational) parameter of the model is 
any rati onal map s : — > C. Such parameters fo rm a field, K ~ Frac(C fc ). In applications 
such as (jMeshkat. Eisenberg. and DiStefand . l2009h . it is important to know to what extent a 



parameter can be identified from observational data alone. In other words, given 0(#), what 
can we say a bout s(9)7 This leads to several different notion s of parameter identifiability, as 
discussed by ISullivant. Garcia-Puente. and Spielvogell t()ld ). 



Definition 6.1. We say that a rational parameter s G K is 

• (set-theoretically) identifiable if s = a o for some set-theoretic function a : 0(0) — > C. 
In other words, for all 9, 9' G 0, if (f)(9) = 0(0') then s(9) = s(9 r ). 

• rationally identifiable if s = a o for some ra tional map a : 0(0) — > C (this notion is 
used without a name by ISullivant et al. 



• generically identifiable if there is a (relatively) Zariski dense open subset U C such 
that s\u = a o (fr\u for some set-theoretic function a : 4>(U) — > C. 

• algebraically identifiable if there is a polynomial function g(p,q) := X)iff»(Pi) ■ ■ ■ iPnjQ 1 
on 0(0) x C of degree d > in q (so that g^ is not identically on 0(0)) such that 
g{(j)(9),s{9)) = for all 9 G (and hence all (9 G C fc ). 

Question 6.2. What combinations of BHM parameters are rationally identifiable, generically 
identifiable, or algebraically identifiable? 

To answer this question we introduce a lemma on algebraic statistical models in general: 

Lemma 6.3. For any algebraic statistical model as above, the sets K r i, K g i, and K a {, of 
rationally, generically, and algebraically identifiable parameters, respectively, are all fields. 

Proof. Since is Zariski irreducible, so is 0(0). Hence the set of rational maps on 0(0) is 
simply the fraction field of its Zariski closure (an irreducible variety), and K r i is the image 
of this field under 0#, which must be a field. 

For Kgi, the crux is to show that if s,s' G K g i and s / then s'/s G K g {. Let U C 
and a : 4>(U) — > C be as in the definition for s, and likewise U' C and a : 4>(U') — > C for 
s'. Let U" = {9 G U n U' \ s(9) ^ 0}, which, being an intersection of three Zariski dense 
open subsets of 0, is a dense open. We have a ^ on 4>(U") C 0(C7) n 4>(U'), so we can let 
a" = a' I a : 4>{U") — > C, and then a" o = s'/s, so s'/s G K g {. Thus K g i is stable under 
division, and simpler arguments show it is stable stable under +, — , and •, so it is a field. 

Finally, K a i is expressly the relative algebraic closure in K of the image under 0# of the 
coordinate ring of 0(0), which is therefore a field. □ 

Proposition 6.4. For any algebraic statistical model as above, K r i C K g i C K a i C K . 

Proof. This is now just a restatement of Proposition 3 in (jSullivant et alll20ld ). □ 



Now, the answer to our identifiability question for BHM parameters can be given easily in 
the coordinates of lSection 41 Here is the BHM map n . The field K r i is simply the image 
q*(Frac(0^)) because by ITheorem 4.1[ 

ip* : Frac(BHMM(n)) -> Frac(0^) 
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is an isomorphism. Hence the rationally identifiable parameters are precisely the field of 
rational functions in (a,b,c,u,v) = (aoVo, b, cqVq, u, Vq) (see © for the meanings of these 
parameters). Since K is a quadratic field extension of K r i given by adjoining vq = y/v, and 
K a i is the algebraic closure of K r i in K (almost by definition), it follows that K a i = K, 
i.e. all parameters are algebraically identifiable. Finally, we observe that, by the action of 
sw in lSection 4.H there are generically two possible values of vq = \{E\i — Eq\) for a given 
observed distribution, namely ±y/v. Hence Vq £ K g i, and since a quadratic field extension 
has no intermediate extensions, it follows that K r i = K g i, i.e. all generically identifiable 
parameters are in fact rationally identifiable. In summary, 

Proposition 6.5. For BHMM(n) where n > 3, 

C(a, b, c, it, v) = K ri = K gi C K ai = C(a , b, c ,u, v ) 



6.4 A new grading on BHMM invariants 

The re-parametrized model map tp n is homogeneous in cumulant and moment coordinates, 
with respect to a Z-grading where deg(m v ) = deg(k v ) = sum(v), deg(b) = 0, deg(a) = 
deg(c) = deg(u) = 1, and deg(v) = 2. This gradi ng allows for fast l i near algebra techniques 
that solve for low degree model invariants as in ( Brav and Mortonl . 20051 ). except that this 



grading is intrinsic to the model. Bray and Morton's grading, which is in probability coor- 
dinates, is not on the binary HMM proper, but on a larger variety obtained by relaxing the 
parameter constraints that the transition and emission matrix row sums are 1 . The invariants 
obtained in their search are hence invariants of this larger variety, and exclude some invari- 
ants of BHMM(n). The grading presented here can thus be used to complete their search for 
invariants up to any finite degree. 



6.5 Equilibrium BHM processes 

In lSection 4.31 we found that if a BHM process is at equilibrium, our formula for i^^ 1 is unde- 
fined. We may define Equilibrium Binary Hidden Markov Models, EBHMMs, by restricting 
(fin to the locus {ao6 — ao + Co = 0}, which turns out to yield a four-dimensional submodel of 
BHMM(n) for each n > 3. The same techniques used here to study BHMMs have revealed 
that the EBHMMs, too, have birational parametrizations, and the ideal of EBHMM(3) is 
generated by the equations m\ = m<i = and m\i = 77113. The geometry of EBHMMs will 
need to be considered explicitly in future work to identify the learning coefficients of BHMM 
fibers. 



6.6 Larger hidden Markov models 

As we have remarked throughout, many results on BHMM(n) can be readily applied to 
HMM(2, k, n), i.e. HMMs with two hidden states and k visible states ai, . . . , a/~. For example, 
consider the parameter identification problem. We may specify the process by a 2 x k matrix 
E of emission probabilities, along with a triple (ao, b, cq) defining the n and T of the two-state 
hidden Markov chain as in ([9]). As in lSection 2.3\ to obtain Eqi and En from the observed 
probability distribution for any fixed j, we simply define a BHM process by letting ct£ = 1 
and ctj = for j 7^ i. Applying [Proposition 4.7| to the moments of the distribution yields 
values for (a, b, c, u, v) provided the genericity condition that the denominators involved do 
not vanish. Letting vq = \/v, ao = a/vo, and Co = c/vq, we obtain (ao, b, Co, u, Vq) up to a 
simultaneous sign change on (ao, cq, vq) corresponding to swapping the hidden alphabet as in 
IScction 4.11 Then E £ = u — v and E\i = u + v, and we get ir, T as well from (ao, b, cq). We 
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can repeat this for each I = 1, . . . , k to obtain all the emission parameters, and hence identify 
all the process parameters modulo the swapping operation. 

For each t E {1, . . . , k}, we can also obtain many polynomial invariants of HMM(2, n, k) 
by reducing to BHMM(n) as above, and marginalizing to collections of 4 equally spaced 
visible nodes to obtain points of BHMM(4) at which we know the invariants of lTheorem 3,11 
will vanish. 

Given these extensions, one can hope that techniques similar to those used here could 
elucidate the algebraic statistics and geometry of HMMs with any number of hidden states 
as well. 
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